IntroductionΒΆ
The primary goal of this work is to explore and forecast rainfall in Bangalore Rural district. The secondary goal is to compare the forecast of a simple SARIMA model with that of Prophet.
Explore dataΒΆ
InΒ [1]:
from IPython.display import display, Markdown
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.offline as po
import plotly.graph_objects as go
from prophet import Prophet
from scipy import stats
from sklearn.metrics import mean_absolute_error, root_mean_squared_error
from sklearn.model_selection import train_test_split
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller
po.init_notebook_mode()
rain_data = pd.read_csv('daily-rainfall-data-district-level.csv', parse_dates=["date"])
InΒ [2]:
# Sanity check
rain_data.describe()
Out[2]:
| id | date | state_code | district_code | actual | rfs | normal | deviation | |
|---|---|---|---|---|---|---|---|---|
| count | 3.577399e+06 | 3577399 | 3.577399e+06 | 3.577399e+06 | 3.437512e+06 | 3.535587e+06 | 2.911744e+06 | 2.431767e+06 |
| mean | 1.788699e+06 | 2016-10-15 22:55:50.544908032 | 1.774312e+01 | 3.364497e+02 | 3.401141e+00 | 5.302644e-01 | 3.569884e+00 | 3.624293e+01 |
| min | 0.000000e+00 | 2009-01-01 00:00:00 | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -1.000000e+02 |
| 25% | 8.943495e+05 | 2012-11-23 00:00:00 | 9.000000e+00 | 1.630000e+02 | 0.000000e+00 | 0.000000e+00 | 2.000000e-01 | -1.000000e+02 |
| 50% | 1.788699e+06 | 2016-10-16 00:00:00 | 1.900000e+01 | 3.220000e+02 | 0.000000e+00 | 0.000000e+00 | 1.100000e+00 | -1.000000e+02 |
| 75% | 2.683048e+06 | 2020-09-08 00:00:00 | 2.400000e+01 | 4.790000e+02 | 1.630000e+00 | 1.758328e-01 | 5.200000e+00 | -4.588500e+01 |
| max | 3.577398e+06 | 2024-07-31 00:00:00 | 3.800000e+01 | 9.999000e+03 | 4.929900e+02 | 2.806424e+02 | 1.548000e+02 | 1.745600e+05 |
| std | 1.032706e+06 | NaN | 9.556991e+00 | 4.266907e+02 | 1.011035e+01 | 2.031809e+00 | 5.599918e+00 | 8.467325e+02 |
Here is a brief description of these columns:
- id: Row Identifier
- state_code and district_code: Codes assigned by Indian Meteorological Department
- actual: Amount of rainfall recorded (in mm)
- rfs : Amount of rainfall storage (in mm)
- normal: Expected amount of rainfall (in mm)
- deviation: Deviation between rfs and normal
Out of these columns, we are interested in the "actual" rainfall for the forecast.
InΒ [3]:
# Missing data analysis
total_actual = rain_data['actual'].count()
missing_actual = rain_data['actual'].isnull().sum()
percent_missing = missing_actual/total_actual
display(Markdown(f"### Missing rows: {missing_actual}/{total_actual} . Percentage: {percent_missing:.2%}"))
Missing rows: 139887/3437512 . Percentage: 4.07%ΒΆ
We would need replace these missing values while cleaning data.
InΒ [4]:
# Distribution analysis
fig = px.histogram(rain_data, x='actual', nbins=30, title="Distribution of Actual Rainfall")
fig.update_layout(xaxis_title="Actual Rainfall (mm)", yaxis_title="Frequency")
fig.show()